Structured knowledge modeling and extraction from images

ABSTRACT

Techniques and systems are described to model and extract knowledge from images. A digital medium environment is configured to learn and use a model to compute a descriptive summarization of an input image automatically and without user intervention. Training data is obtained to train a model using machine learning in order to generate a structured image representation that serves as the descriptive summarization of an input image. The images and associated text are processed to extract structured semantic knowledge from the text, which is then associated with the images. The structured semantic knowledge is processed along with corresponding images to train a model using machine learning such that the model describes a relationship between text features within the structured semantic knowledge. Once the model is learned, the model is usable to process input images to generate a structured image representation of the image.

RELATED APPLICATIONS

This Application claims priority to U.S. Provisional Patent ApplicationNo. 62/254,143, filed Nov. 11, 2015, and titled “Structured KnowledgeModeling and Extraction from Images,” the disclosure of which isincorporated by reference in its entirety.

BACKGROUND

Image searches involve the challenge of matching text in a searchrequest with text associated with the image, e.g., tags and so forth.For example, a creative professional may capture an image and associatetags having text that are used to locate the image. On the other side, auser trying to locate the image in an image search enters one or morekeywords. Accordingly, this requires that the creative professional andthe user reach agreement as to how to describe the image using text,such that the user can locate the image and the creative professionalcan make the image available to the user. Conventional tag and keywordsearch techniques may be prone to error, misunderstandings, anddifferent interpretations due to this requirement that a commonunderstanding is reached between the creative professional and the userin order to locate the images.

Further, conventional search techniques for images do not support highprecision semantic image search due to limitations of conventional imagetagging and search. This is because conventional techniques merelyassociate tags with the images, but do not define relationships betweenthe tags nor with the images itself. As such, conventional searchtechniques cannot achieve accurate search results for complex searchqueries, such as “a man feeding a baby in a high chair with the babyholding a toy.” Consequently, these conventional search techniques forceusers to navigate through tens, hundreds, and even thousands of imagesoftentimes using multiple search requests in order to locate a desiredimage, which is caused by forcing the user in conventional techniques togain an understanding as to how the creative professional described theimage for location as part of the search.

SUMMARY

Techniques and systems to extract and model structured knowledge fromimages are described. In one or more implementations, a digital mediumenvironment is configured to learn and use a model to compute adescriptive summarization of an input image automatically and withoutuser intervention. Training data (e.g., image and unstructured text suchas captions) is first obtained to train a model using machine learningin order to generate a structured image representation that serves asthe descriptive summarization of an input image.

The images and associated text are then processed to extract structuredsemantic knowledge from the text, which is then associated with theimages. Structured semantic knowledge may take a variety of forms, suchas <subject, attribute> and <subject, predicate, object> tuples thatfunction as a statement linking the subject to the object via thepredicate. This may include association with the image as a whole and/orobjects within the image through a process called “localization.”

The structured semantic knowledge is then processed along withcorresponding images to train a model using machine learning such thatthe model describes a relationship between text features within thestructured semantic knowledge (e.g., subjects and objects) and imagefeatures of images, e.g., portions of the image defined in boundingboxes that include the subjects or objects.

Once the model is learned, the model is then usable to process inputimages to generate a structured image representation of the image. Thestructured image representation may include text that is structured in away that describes relationships between objects in the image and theimage itself. The structured image representation may be used to supporta variety of functionality, including image searches, automatic captionand metadata generation, object tagging, and so forth.

This Summary introduces a selection of concepts in a simplified formthat are further described below in the Detailed Description. As such,this Summary is not intended to identify essential features of theclaimed subject matter, nor is it intended to be used as an aid indetermining the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is described with reference to the accompanyingfigures. In the figures, the left-most digit(s) of a reference numberidentifies the figure in which the reference number first appears. Theuse of the same reference numbers in different instances in thedescription and the figures may indicate similar or identical items.Entities represented in the figures may be indicative of one or moreentities and thus reference may be made interchangeably to single orplural forms of the entities in the discussion.

FIG. 1 is an illustration of an environment in an example implementationthat is operable to employ knowledge extraction techniques from imagesas described herein.

FIG. 2 depicts another example of an image from which knowledge isextracted using a knowledge extraction system of FIG. 1.

FIG. 3 depicts a system showing the knowledge extraction system of FIG.1 in greater detail.

FIG. 4 depicts an example implementation showing an extractor module ofFIG. 3 in greater detail.

FIG. 5 depicts an example system in which an extractor module of FIG. 4is shown as including localization functionality as part of knowledgeextraction.

FIG. 6 depicts an example of localization of structured semanticknowledge to portions of images.

FIG. 7 depicts an example implementation showing a model training moduleof FIG. 3 in greater detail as employing a machine learning module tomodel a relationship between the structured semantic knowledge andimages.

FIG. 8 depicts an example implementation showing training of a modelusing a two module machine learning system.

FIG. 9 is a flow diagram depicting a procedure in an exampleimplementation in which a digital medium environment is employed toextract knowledge from an input image automatically and without userintervention.

FIG. 10 is a flow diagram depicting a procedure in an exampleimplementation in which a digital medium environment is employed toextract knowledge and localize text features to image features of aninput image.

FIG. 11 depicts a system for structured face image embedding.

FIG. 12 depicts Model 1 and Model 2 as part of machine learning.

FIG. 13 illustrates an example system including various components of anexample device that can be implemented as any type of computing deviceas described and/or utilize with reference to FIGS. 1-12 to implementembodiments of the techniques described herein.

DETAILED DESCRIPTION Overview

Techniques and systems are described that support knowledge extractionfrom an image in order to generate a descriptive summarization of theimage, which may then be used to support image search, automaticgenerations of captions and metadata for the image, and a variety ofother uses. The descriptive summarization, for instance, may describequalities of the image as a whole as well as attributes, objects, andinteraction of the objects, one to another, within the image as furtherdescribed below. Accordingly, although examples involving image searchesare described in the following, these techniques are equally applicableto a variety of other examples such as automated structured imagetagging, caption generation, and so forth.

Training data is first obtained to train a model using machine learningin order to generate a structured image representation. Techniques aredescribed herein in which training data is obtained that uses images andassociated text (e.g., captions of the images which include any type oftext configuration that describes a scene captured by the image) thatmay be readily obtained from a variety of sources. The images andassociated text are then processed automatically and without userintervention to extract structured semantic knowledge from the text,which is then associated with the images. This may include associationwith the image as a whole and/or objects within the image through aprocess called “localization” in the following. Use of this trainingdata differs from conventional techniques that rely on crowd sourcing inwhich humans manually label images, which can be expensive, prone toerror, and inefficient.

In one example, structured semantic knowledge is extracted from the textusing natural language processing. Structured semantic knowledge maytake a variety of forms, such as <subject, attribute> and <subject,predicate, object> tuples that function as a statement linking thesubject to the object via the predicate. The structured semanticknowledge is then processed along with corresponding images to train amodel using machine learning such that the model describes arelationship between text features within the structured semanticknowledge (e.g., subjects and objects) and image features of images,e.g., portions of the image defined in bounding boxes that include thesubjects or objects. In one example, the model is a joint probabilisticmodel that is built without requiring reduction of a large vocabulary ofindividual words to small pre-defined set of concepts and as such themodel may directly address this large vocabulary, which is not possibleusing conventional techniques.

For example, localization techniques may be employed such that thestructured semantic knowledge is mapped to corresponding object withinan image. A <baby, holding, toy> tuple, for instance, may explicitly mapthe subject “baby” in an image to the object “toy” in the image usingthe predicate “holding” and thus provides a structure to describe “whatis going on” in the image. This is not possible in conventionalunstructured tagging techniques that were not explicit in that acorrespondence between a particular object in the image with the tag wasnot possible such that if multiple object were included in the image anddistinction was not made between the object, e.g., multiple babies.Thus, use of explicit, structured knowledge provided by the techniquesdescribed herein may be leveraged in a way that is searchable by acomputing device.

If one searches for images of a “red flower”, for instance, aconventional bag-of-words approach considers “red” and “flower”separately, which may return images of flowers that are not red, buthave red elsewhere in the image. However, use of the techniquesdescribed herein know that a user is looking for the concept of <flower,red> from a structure of a search request, which is then used to locateimages having a corresponding structure. In this way, the model mayachieve increased accuracy over techniques that rely on description ofthe image as a whole, as further described in relation to FIGS. 5 and 6in the following.

Further, this mapping may employ a common vector space that penalizesdifferences such that similar semantic concepts are close to each otherwithin this space. For example, this may be performed for featurevectors for text such that “curvy road” and “winding road” arerelatively close to each other in the vector space. Similar techniquesare usable to promote similar concepts for image vectors as well as toadapt the image and text vectors to each other. A variety of machinelearning techniques may be employed to train the model to perform thismapping. In one such example, a two column deep network is used to learnthe correlation between the structured semantic information and an imageor portion of an image, e.g., bounding box, an example of which is shownin FIG. 8.

Once the model is learned, the model is then usable to process inputimages to generate a structured image representation of the imagethrough calculation of a confidence value to describe which text bestcorresponds with the image. The model, for instance, may loop overbounding boxes of parts of the image to determine which structured text(e.g., <flower, red>) likely describes that part of the image such asobjects, attributes, and relationships there between through calculationof probabilities (i.e., the confidence values) that the structured textdescribes a same concept as image features in the image. In this way,the structured image representation provides a descriptive summary ofthe image that uses structured text to describe the images and portionsof the image. The structured image representation may thus be calculatedfor an image to include text that is structured in a way that describesrelationships between objects in the image (e.g., flower), attributes ofthe object (e.g., red), relationships between (e.g., <flower, red><baby, holding, toy>) and the image itself as described above. Thestructured image representation may be used to support a variety offunctionality, including image searches, automatic caption and metadatageneration, automated object tagging, and so forth. Further discussionof these and other examples is included in the following sections.

In the following discussion, an example environment is first describedthat may employ the knowledge extraction techniques described herein.Example procedures are then described which may be performed in theexample environment as well as other environments. Consequently,performance of the example procedures is not limited to the exampleenvironment and the example environment is not limited to performance ofthe example procedures.

Example Environment

FIG. 1 is an illustration of an environment 100 in an exampleimplementation that is operable to employ knowledge extractiontechniques described herein. The illustrated environment 100 includes acomputing device 102, which may be configured in a variety of ways.

The computing device 102, for instance, may be configured as a desktopcomputer, a laptop computer, a mobile device (e.g., assuming a handheldconfiguration such as a tablet or mobile phone as illustrated),wearables, and so forth. Thus, the computing device 102 may range fromfull resource devices with substantial memory and processor resources(e.g., personal computers, game consoles) to a low-resource device withlimited memory and/or processing resources (e.g., mobile devices).Additionally, although a single computing device 102 is shown, thecomputing device 102 may be representative of a plurality of differentdevices, such as multiple servers utilized by a business to performoperations “over the cloud” as further described in relation to FIG. 13.

The computing device 102 is illustrated as including a knowledgeextraction system 104 that is representative of functionality to form astructured image representation 106 from an image 108 that descriptivelysummarizes the image 108. The structured image representation 106 isusable to support a variety of functionality, such as to be employed byan image search module 110 to search a database 112 of images 114 basedon corresponding structured image representations. As previouslydescribed, other uses of the structured image representation 106 arealso contemplated, such as automatic generation of captions and metadatafor images as represented by a caption generation system 118.Additionally, although the knowledge extraction system 104 and imagesearch module 110 and database 112 are illustrated as implemented usingcomputing device 102, this functionality may be further divided “overthe cloud” via a network 116 as further described in relation to FIG.13.

The structured image representation 106 provides a set of concepts withstructure that describes a relationship between entities included in theconcepts. Through this, the structured image representation may functionas an intermediate representation of the image 108 using text todescribe not only “what is included” in the image 108 but also arelationship, one to another, of entities and concepts included in theimage 108. This may be used to support a higher level of semanticprecision in an image search that is not possible using conventionaltechniques that relied on unstructured tags.

A high precision semantic image search, for instance, involves findingimages with the specific content requested in a textual search query.For example, a user may input a search query of a “man feeding baby inhigh chair with the baby holding a toy” to an image sharing service tolocate an image of interest that is available for licensing.Conventional techniques that relied on unstructured tags, however, arenot able to accurately satisfy this query. In practice, conventionalimage search provide images typically satisfy some, but not all, of theelements in the query, such as a man feeding a baby, but the baby is notholding a toy, a baby in a high chair, but there is no man in thepicture, a picture of a woman feeding a baby holding a toy, and so forthdue to this lack of structure.

A structured image representation 106, however, provides an explicitrepresentation of what is known about an image 108. This supports anability to determine which concepts in a search query are missing in asearched database image and thus improve accuracy of search results.Accordingly, a measure of similarity between the search query and animage 114 in a database 112 can incorporate which and how many conceptsare missed. Also, if there is an image that is close to satisfying thequerying but misses a concept, techniques may be employed to synthesizea new image using the close image and content from another image thatcontains the missing concept as further described in the following.

Consider an example of use of the structured image representation 106 inwhich the extracted knowledge of the image 108 includes the following:

-   -   {<man, smiling>, <baby, smiling>, <baby, holding, toy>, <man,        sitting at, table>, <baby, sitting in, high chair>, <man,        feeding, baby>, <baby, wearing, blue clothe s>}.        The caption generation system 118 is configured to use this        extracted knowledge to generate a caption as follows:    -   “A man is feeding a smiling baby while the baby holds a toy. The        baby is sitting in a high chair. The man is happy too. It is        probably a dad feeding his son. The dad and his son are having        fun together while mom is away.”

Thus, the explicit representation of knowledge of the structured imagerepresentation 106 allows for a multiple sentence description of thescene of the image 108 as a caption in this example that is formedautomatically and without user intervention. The first two sentences area straightforward inclusion of the concepts <man, feeding, baby>, <baby,holding, toy>, and <baby, sitting in, high chair>. The third sentenceinvolves reasoning based on the concepts <man, smiling> and <baby,smiling> to deduce by the caption generation system 118 that the man ishappy and to add the “too” because both the baby and man are smiling.The fourth sentence also uses reasoning on the extracted concept thatthe baby is wearing blue to deduce that the baby is a boy.

The caption generation system 118 may also use external statisticalknowledge, e.g., that most of the time when a man is feeding a baby boy,it is a father feeding his son. The generated fourth sentence above istempered with “It is probably . . . ” because statistics may indicate areasonable amount of uncertainty in that deduction and because there mayalso be uncertainty in the deduction that the baby is boy because thebaby is wearing blue clothes. Since the structured image representation106 may be used to all extract relevant information about the scene, theabsence of information may also be used as part of deductions performedby the caption generation system 118. In this case, the structured imagerepresentation 106 does not mention a woman as being present in theimage 108. Thus, the caption generation system 118 may deduce that the“mom is away” and combined with the concepts that the man and baby aresmiling, generate the final sentence “The dad and his son are having funtogether while mom is away.”

Note that a caption generation system 118 may avoid use of some of theextracted information. In this case, the caption did not mention thatthe man was sitting at the table because the caption generation system118 deemed that concept uninteresting or unimportant in describing thescene or that it could be deduced with high probability from anotherconcept such as that the baby is sitting in a high chair. This reasoningis made possible through use of the structured image representation 106as a set of structured knowledge that functions as an descriptivesummarization of the image 106 using text.

The structured image representation 106 may also include part-of-speech(POS) tags such as singular noun, adjective, adverb, and so on for theextracted subjects, predicates, actions, attributes, and objects. Thepart-of-speech tags can be used as part of reasoning as described aboveas well as slot filling in a grammar-based caption generation approach,and to ensure that a valid sentence is generated as further describedbelow.

Additionally, explicit extraction of knowledge of images 108 at thelevel of objects within the image 108 and corresponding attributes andinteractions allows for further reasoning about middle and higher levelscene properties. The deductions about the baby being a boy, the manbeing happy, and the dad and son having fun while mom is away areexamples.

FIG. 2 depicts another example of an image 200. In this example, thestructured image representation 106 may include the following knowledgethis is extracted from the image 200:

-   -   {<soccer ball>, <person 1, wearing, blue shirt>, <person 2,        wearing, red shirt>, <person 3, wearing, red shirt>, <person 4,        wearing, red shirt>, <person 5, wearing, blue shirt>, <person 6,        wearing, blue shirt>, <field>, <person 5, kicking, soccer ball>,        <person 6, running>, <person 4, chasing, person 5>, <person 3,        running>, <person 1, running>}.        The existence of a soccer ball indicates that the people are        playing soccer, which is further supported by knowledge that one        of the people are kicking the soccer ball. That there are only        two different color shirts indicates that there are two teams        playing a game. This is backed up by the knowledge that a person        in red is actually chasing the person in blue that is kicking        the ball, and that other people are running on a field. From        this extracted object level knowledge, scene level properties        may be deduced by the caption generation system 118 with        enhanced object level descriptions, such as “A soccer match        between a team in red and a team in blue”.

Further reasoning and deduction about scenes and their constituentobjects and actions may also be achieved by building a knowledge baseabout the content of images where the knowledge base is then used by areasoning engine. The construction of a knowledge base, for instance,may take as an input structured knowledge describing images such as<subject, attribute,->, <subject, predicate, object>, <subject,-,->,<-,action,->. Input data for constructing the knowledge base can betaken from existing image caption databases and image captions andsurrounding text in documents. The ability of the techniques describedherein to extract such knowledge from any image allows the imageknowledge base to include much more data from uncaptioned and untaggedimages, which is most images. The image knowledge base and correspondingreasoning engine can make deductions such as those needed in the manfeeding baby captioning example above. The image knowledge base can alsoprovide the statistics to support the probabilistic reasoning used inthat example such as deducing that the man is likely the baby's father.If the example had included an attribute like <man, old>, then a morelikely deduction may include that the man is the baby's grandfather.

Having described examples of an environment in which a structured imagerepresentation 106 is used to descriptively summarize images 114,further discussion of operation of the knowledge extraction system 104to generate and use a model to as part of knowledge extraction fromimages is included in the following.

FIG. 3 depicts a system 300 an example implementation showing theknowledge extraction system 104 of FIG. 1 in greater detail. In thisexample, the knowledge extraction system 104 employs a machine learningapproach to generate the structured image representation 106.Accordingly, training data 302 is first obtained by the knowledgeextraction system 110 that is to be used to train the model that is thenused to form the structured image representation 106. Conventionaltechniques that are used to train models in similar scenarios (e.g.,image understanding problems) rely on users to manually tag the imagesto form the training data 302, which may be inefficient, expensive,time-consuming, and prone to error. In the techniques described herein,however, the model is trained using machine learning using techniquesthat are performable automatically and without user intervention.

In the illustrated example, the training data 302 includes images 304and associated text 306, such as captions or metadata associated withthe images 304. An extractor module 308 is then used to extractstructured semantic knowledge 310, e.g., “<Subject,Attribute>, Image”and “<Subject,Predicate,Object>, Image”, using natural languageprocessing as further described in relation to FIG. 4. Extraction mayalso include localization of the structured semantic knowledge 310 toobjects within the image as further described in relation to FIGS. 5 and6.

The images 304 and corresponding structured semantic knowledge 310 arethen passed to a model training module 312. The model training module312 is illustrated as including a machine learning module 314 that isrepresentative of functionality to employ machine learning (e.g., neuralnetworks, convolutional neural networks, and so on) to train the model316 using the images 304 and structured semantic knowledge 310. Themodel 316 is trained to define a relationship between text featuresincluded in the structured semantic knowledge 310 with image features inthe images as further described in relation to FIG. 7.

The model 316 is then used by a structured logic determination module318 to generate a structured image representation 106 for an input image108. The structure image representation 106, for instance, may includetext that is structured to define concepts of the image 108, even ininstances in which the image 108 does not have text. Rather, the model316 is usable to generate this text as part of the structured imagerepresentation 106, which is then employed by the structured imagerepresentation use module 320 to control a variety of functionality,such as image searches, caption and metadata generation, and so onautomatically and without user intervention. Having described examplemodules and functionality of the knowledge extraction system 110generally, the following discussion includes a description of thesemodules in greater detail.

FIG. 4 depicts an example implementation 400 showing the extractormodule 308 of FIG. 3 in greater detail. The extractor module 308includes a natural language processing module 402 that is representativeof functionality to use natural language processing (NLP) for semanticknowledge extraction from free-form (i.e., unstructured) text 306associated with images 304 in the training data 302. Such free-formdescriptions are readily available in existing image caption databasesand documents with images such as web pages and PDF documents and thusthe natural language processing module 402 may take advantage of thisavailability, which is not possible using conventional manualtechniques. However, manual techniques may also be employed in which aworker generates text 306 captions for images 304 to describe the images304.

The structured semantic knowledge 310 is configurable in a variety ofways as previously described, such as “<subject, attribute>, image” 406and/or “<subject, predicate, object>, image” 408 tuples. Examples ofcaptions and structured knowledge tuples as performed by the extractormodule 308 include “A boy is petting a dog while watching TV” which isthen extracted as “<boy, petting, dog>, <boy, watching, tv>.” In anotherexample, a caption “A brown horse is eating grass in a big green field”is then extracted as “<horse, brown>, <field, green>, <horse, eating,grass>, <horse, in, field>.”

A variety of tuple extraction solutions may be employed by the naturallanguage processing module 402. Additionally, in some instances aplurality of tuple extraction techniques may be applied to the sameimage caption and consensus used among the techniques to correctmistakes in tuples, remove bad tuples, and identify high confidencetuples or assign confidences to tuples. A similar technique may beemployed in which a tuple extraction technique is used to perform tupleextraction jointly on a set of captions for the same image and consensusused to correct mistakes in tuples, remove bad tuples, and identify highconfidence tuples or assign confidences to tuples. This data is readilyavailable from existing databases as images oftentimes have multiplecaptions. Additionally, inputs obtained from crowd sourcing may also beused confirm good tuples and to remove bad tuples.

In one or more implementations, abstract meaning representation (AMR)techniques are used by the natural language processing module 402 to aidin tuple extraction. AMR is aimed at achieving a deeper semanticunderstanding of free-form text. Although it does not explicitly extractknowledge tuples of the form <Subject, Attribute> or <Subject,Predicate, Object>, a tuple representation may be extracted from an AMRoutput. Additionally, knowledge tuples may be extracted from a scenegraph (e.g., a Stanford Scene Graph dataset) which is a type of imagerepresentation for capturing object attributes and relationships for usein semantic image retrieval.

FIG. 5 depicts an example system 500 in which the extractor module 308of FIG. 4 is shown as including localization functionality as part ofknowledge extraction. In addition to extraction of structured semanticknowledge 310 to describe an image as a whole as part of the trainingdata 302, structured semantic knowledge 310 may also be localized withinan image to promote efficient and correct machine learning.

If there is a complex scene with a man walking a dog, for instance, thenthe structured semantic knowledge 310 may be configured as “<man,walking, dog>, image data” with the image data referring to a portion ofthe image 304 that includes the man walking the dog, which is referredto as a bounding box 504 in the following. Thus, tuples of thestructured semantic knowledge 310 may refer to portions within theimage, examples of which are represented as “<subject, attribute>,portion” 506 and “<subject, predicate, object>, portion” 508.

Accordingly, this may promote accuracy in training and subsequent usefor images having multiple entities and corresponding actions. Forexample, if an entirety of an image that is captioned that includesmultiple concepts, e.g., a woman jogging or a boy climbing a tree, thenany machine learning performed will be confronted with a determinationof which part of the image is actually correlated with <man, walking,dog>. Therefore, the more the structured semantic knowledge 310 islocalized, the easier it will be to fit a high quality model thatcorrelates images and structured text by the model training module 312.The problem of associating parts of a textual description with parts ofan image is also called “grounding”.

The grounding and localization module 502 may employ a variety oftechniques to perform localization. In one example, object detector andclassifier modules that are configured to identify particular objectsand/or classify objects are used to process portions of images 304. Aregion-CNN (convolutional neural network) or a semantic segmentationtechnique may also be used to localize objects in an image.

In another example, structured semantic knowledge 310 tuples such as<Subject, Attribute> and <Subject, Predicate, Object> and localizedobjects are identified by calculating how many class occurrences havebeen localized for the subject and object classes as further describedbelow. This may also include identifying subjects or objects thatindicate that the tuple describes an entire scene, in which case theentire training image 304 is associated with the tuple of the structuredsemantic knowledge 310. To do so, an external list of scene types isused, e.g., bathroom.

Before the grounding and localization module 502 can look up thebounding boxes for an object class mentioned in the subject or object ofa tuple, the text used for the subject or object is mapped to apre-defined subset of database objects since bounding boxes aretypically stored according to those class labels. For example, themapping problem may be solved from subject or object text “guy” to apre-defined class such as “man” by using a hierarchy to perform thematching.

Once a set of bounding boxes 504 in an image 304 for the subject andobject classes in a <Subject, Predicate, Object> triple or the boundingboxes 504 for a <Subject, Attribute> double are obtained, rules andheuristics are then employed by the grounding an localization module 502to localize a tuple of the structured semantic knowledge 310 within thetraining image 304. In a first such example, for a <Subject, Attribute>tuple, if there is only a single occurrence of a subject class in theimage 304 (e.g. just one car) then the tuple is associated with thesingle bounding box for that tuple since the bounding box 504 containsthe subject and the attribute describes the subject within that box,e.g., “<car, shiny>.”

For a <Subject, Predicate, Object> tuple with only a single occurrenceof the subject class and one occurrence of the object class, the tupleis associated with the smallest rectangular image area that covers thebounding box for the subject and the bounding box for the object, i.e.,the bounding box of the two bounding boxes. For example, if there is asingle person and a single dog in the image, then <person, walking, dog>is localized to the person and dog bounding boxes. This likely containsthe leash connecting the person and dog. In general, the tacitassumption here is that the predicate relating the subject and object isvisible near the subject and object.

For a <Subject, Predicate, Object> tuple with a singular subject and asingular object (“car” not “cars”) and more than one occurrence ofeither the subject class or the object class, the following isdetermined. If a nearest pair of bounding boxes 504 with one from thesubject class and one from the object class is within a thresholddistance, then this tuple is associated with the bounding box of thenearest pair bounding boxes. The assumption here is that relationshipbetween a subject and object can be well localized visually. Thedistribution of the distances between each of the pairs may also be usedto determine if there is uncertainty in this choice because of a secondor third pair that also has a small distance.

The above heuristics give examples of types of information considered inlocalization. Additional techniques may also be used to aid localizationperformed by the grounding and localization module 502. An example ofthis is illustrated by a text semantic module 510 that is representativeof functionality of use of text understanding to aid in groundingsubjects and objects in the image. In one example, positional attributesassociated with a subject are used to select or narrow down the correctbounding box for that subject. If there are several cars in a scene, forinstance, but the caption states “There is a child sitting on the hoodof the leftmost car”, then the text semantic module 510 may aid inselecting the bounding box with the minimum horizontal coordinate toground as the leftmost car in this caption and in the <child, sittingon, car> tuple extracted from it. Instead of using the bounding box ofall bounding boxes for cars in the example above, the bounding box ofjust the grounded car or of the subset of cars that match the “leftmost”criterion may be used. This determination may be generalized to othercriteria that may be measured, such as color.

In grounding a tuple, the grounding and localization module 502 firstreduces a set of bounding boxes for the subject and the object usingtheir attributes to filter out bounding boxes 504 that do not includethese attributes. Such attributes include position, color, and proximityto other identifiable regions, e.g., for “the car on the grass” thegrass region is discoverable using a semantic segmentation algorithm.

Relative positional information is also used to select the correct pairof subject class and object class bounding boxes for a positionalrelationship. For example, if the caption is “A baby sits on top of atable”, then the baby and table are grounded to rectangles in the imagewith the baby rectangle above the table rectangle. As such, thisuniquely identifies the image area to associate with this tuple if thereare multiple babies and/or multiple tables in the scene.

For a <Subject, Predicate, Object> tuple with the subject and objectgrounded in the image, the tuple with a smallest rectangular image areathat covers the bounding box for the subject and the bounding box forthe object. A variety of other examples are also contemplated, such asto add an amount of context to bounding boxes through inclusion of alarger area than would otherwise be included in a “tight” bounding box.

FIG. 6 depicts an example implementation 600 of localization betweenportions of an image 108 and structured semantic knowledge 310. Asillustrated, a bounding box 602 for “<man, sitting on, chair>” includesthe man and the chair. A bounding box 604 for “<man, feeding, baby>”includes both the man and the baby. A bounding box 606 for “<baby,holding, toy>” includes the baby and the toy. Having describedextraction of structured semantic knowledge 310, the following includesdiscussion of use of this extracted structured semantic knowledge 310 totrain a model 316 by the model training module 312.

FIG. 7 depicts an example implementation 700 showing the model trainingmodule 312 in greater detail as employing a machine learning module 314to model a relationship between the structured semantic knowledge 310that was extracted from the test 306 and the images 304. In thisexample, the machine learning module 314 is configured to model arelationship 702 between text features 704 of the structured semanticknowledge 310 with image features of the image 304 of the training data302 in order to train the model 316.

The model 316, for instance, may be formed as a joint probabilisticmodel having a “P(<Subject, Attribute>, Image I), P(<Subject, Predicate,Object>,Image I).” The model 316 is built in this example to output aprobability that image “I” and structured text <Subject, Attribute> or<Subject, Predicate, Object> represent the same real world conceptvisually and textually. The model 316 in this example is configured togeneralize well to unseen or rarely seen combinations of subjects,attributes, predicates, and objects, and does not require explicitreduction of a large vocabulary of individual words to a small,pre-defined set of concepts through use of identification of latentconcepts and matching of this structure as further described below.

The model 316, once trained, is configured to locate images based onstructured text by computing probabilities that images correspond to thestructured text. For instance, a text-based image search involvesmapping a text query (e.g., represented as a set of structured knowledgeusing a natural language tuple extraction technique) to an image. Thisis supported by a joint model as further described in relation to FIG. 8by looping over images “I” and checking which gives a high probability“P(structured text <S,P,O>, image I)” for a given concept <S,P,O>.Knowledge extraction/tagging is supported by looping over possibleconcepts <S,P,O> and checking which gives a high probability“P(structured text <S,P,O>, image I)” for a given image or image portion“I.”

There are two parts to forming the model 316 by the model trainingmodule 312. The first is to generate a feature representation for thestructured text “<S,P,O>,” “<S,A,->,” “<S,-,->” (where “-” indicates anunused slot to represent all concepts as triples) and for images. Thesecond part is to correlate the feature representation of the structuredtext with image features, e.g., to correlate text feature “t” 704 andimage feature “x: P(t,x)” 706. For example, to define a relationship 702of text features (e.g., <man, holding, ball>) with images features 706of an image that show a ball being held by a man. These two parts arefurther described in greater detail below.

The structured semantic knowledge 310 “<S,P,O>” and “<S,A>” tuples areconfigured such that similar structured knowledge concepts have nearbyand related representations, e.g., as vectors in a vector space. Thissupports generalization and use of a large vocabulary. For example, textfeature 704 representations of “<road, curvy>” and “<road, winding>” areconfigured to be similar and the representations between “<dog,walking>” and “<person, walking>” are related by the common action ofwalking. This may be performed such that similar words are nearby in thespace and the vector space captures some relationships between words.For example, vec(“man”)+(vec(“queen”)−vec(“woman”))=vec(“king”).

The model training module 312 may also be configured to build uponsemantic vector representations of single words to develop a vectorrepresentation of knowledge tuples which captures the relationshipbetween two concepts “<S1,P1,O1>” and “<S2,P2,O2>.” Specifically, afeature vector is built for an “<S,P,O>” triple as a function of singleword representations “vec(S),” “vec(P),” and “vec(O).” The“vec(<S,P,O>)” is built as a concatenation of the individual wordvectors “vec(<S,P,O>)=[vec(S) vec(P) vec(O)].”

When an “<S,P,O>” element is missing, such as the object “0” whenrepresenting a “<Subject, Attribute>” or both a predicate “P” and object“0” when representing a “<Subject>,” the corresponding vector slot isfilled using zeros. Thus the vector representation for a subject,solely, lies along the “S” axis in “S,P,O” space. Visual attributes maybe addressed as modifiers for an unadorned subject that move therepresentation of “<S,P>” into the “SP” plane of “S,P,O” space. Anotheroption involves summing the vector representations of the individualwords.

For a compound “S” or “P” or “O,” the vector representation for eachindividual word in the phrase is averaged to insert a single vector intoa target slot of a “[vec(S) vec(P) vec(O)]” representation. For example,“vec(“running toward”)” is equal to“0.5*(vec(“running”)+vec(“toward”)).” A non-uniform weight average mayalso be used when some words in the phrase carry more meaning thanothers. In an implementation, a semantic representation (e.g., vector orprobability distribution) is learned directly for compound phrases suchas “running toward” or “running away from” by treating these phrasesatomically as new vocabulary elements in an existing semantic wordembedding model.

There are a variety of choices of techniques that are usable to capturesemantics of image features 706. In one such example, a deep machinelearning network is used that has a plurality of levels of features thatare learned directly from the data. In particular, convolution neuralnetworks (CNNs) with convolution, pooling, and activation layers (e.g.,rectified linear units that threshold activity) have been proven forimage classification. Examples include AlexNet, VGGNet, and GoogLeNet.

Additionally, classification features from deep classification nets havebeen shown to give high quality results on other tasks (e.g.segmentation), especially after fine tuning these features for the othertask. Thus, starting from features learned for classification and thenfine tuning these features for another image understanding task mayexhibit increased efficiency in terms of training than starting trainingfrom scratch for a new task. For the reasons above, CNN features areadopted as fixed features in a baseline linear CCA model. The machinelearning module 314 then fine tunes the model 316 from a CNN in a deepnetwork for correlating text and images features 704, 706.

The machine learning module 316 is configured to map text features “t”704 and image features “x” 706 into a common vector space and penalizedifferences in the mapped features when the same or similar concepts arerepresented by “t” and “x.”

One technique that may be leveraged to do so include a linear mappingreferred to as Canonical Correlation Analysis (CCA) which is applied totext and image features 704, 706. In CCA, matrices “T” and “X” arediscovered that map feature vectors “t” and “x,” respectively, into acommon vector space “t′=Tt” and “x′=Xx.” If the mapping is performedinto a common space of dimension “D,” and “t” is a vector in“D_t-dimensional space,” and “x” is a vector in “D_x-dimensional space,”then “T” is a “(D by D_t)” matrix, “X” is a “(D by D_x)” matrix, and themapped representations t′ and x′ are D-dimensional vectors.

Loss functions may be employed for model fitting using training pairs“(t,x)” based on squared Euclidean distance “|t′-x′|_2{circumflex over( )}2” or a cosine similarity “dot_product(t′,x′)” or the “anglebetween(t′,x′)” which removes the vector length from the cosinesimilarity measure. When the dot product is used, then the CCAcorrelation function is expressed as follows:f(t,x)=f_CCA_dp(t,x)=tr(Tt)*Xx=tr(t)*M*x=sum_{i,j}t_ i M_{ij}x_ j,where “tr” equals transpose, and “M=tr(T)*X is (D_t by D_x),” andsubscripts indicate vector components. This form supports a faster thanexhaustive search for images or text given the other. For example, intext-based image search, images with feature vectors “x” are found suchthat “dotprod(v,x)” is large, where “v=tr(t)*M.”

For a squared Euclidean loss, the CCA correlation function may beexpressed as follows:f(t,x)=f_CCA_E(t,x)=|Tt−Xx|_2{circumflex over ( )}2.Again, the simple closed form of the correlation function above may alsosupport faster than exhaustive search for images or text given theother. For example, in text-based image search images with featurevectors “x” are found such that “f_CCA_E(t,x)” is small for a given textvector “t.” Given “(T,X)” from fitting the CCA model and the query “t,”linear algebra provides a set of vectors that minimize “f(t,x)” andimages are found with feature vector “x” close to this set.

FIG. 8 depicts an example of a deep network 800 for correlating text andimages as part of machine learning. The deep network 800 includes a textmachine learning module 802 (e.g., column) and an image machine learningmodule 804 (e.g., column that is separate from the column thatimplements the text machine learning module 802) that are configured tolearn the correlation “f(<S,P,O>,I)” between structured semanticknowledge “<S,P,O>” and an image or image portion “I” by non-linearmapping into a common space.

The text machine learning module 802 starts with a semantic text vectorrepresentation “t” that includes vec(S) 806, vec(P) 808, and vec(O) 810which is then passed through sets of fully connected and activationlayers 812 to output a non-linear mapping t->t′ as a feature vector forthe text 814.

The image machine learning module 804 is configured as a deepconvolutional neural network 814 (e.g., as AlexNet or VGGNet orGoogLeNet with the final layers mapping to probabilities of classremoved) that starts from image pixels of the image 816 and outputs afeature vector x′ for the image 814. The image module is initialized asthe training result of an existing CNN and the image features are finetuned to correlate images with structured text capturing imageattributes and interactions instead of just object class discriminationas in the existing CNN.

Adaptation layers 822, 824 in the text and image machine learningmodules 802, 804 adapt the representations according to a non-linearfunction to map it into a common space with image features representingthe same concept. A loss layer 828 joins the modules and penalizesdifferences in the outputs t′ and x′ of the text and image machinelearning modules 802, 804 to encourage mapping into a common space forthe same concept.

A discriminative loss function such as a ranking loss may be used toensure that mismatched text and images have smaller correlation orlarger distance than correctly matched text and images. For example, asimple ranking loss function may require correlations“dot_prod(t_i′,x_i′)>dot_prod(t_j′,x_i′)” for a training example“(t_i,x_i)” and where the original tuple for training tuple t_j does notmatch training image “x_i.” A ranking loss may also use a semantic textsimilarity or an external object hierarchy such as ImageNet to formulatethe loss to non-uniformly penalize different mismatches.

Other loss functions and architectures are possible, for example withfewer or more adaptation layers between the semantic text representation“t=[vec(S),vec(P),vec(O)]” and the embedding space t′ or withconnections between text and image layers before the common embeddingspace. In one example, a wild card loss that ignores the object part ofembedding vectors for second order facts <S, P> and the predicate andobject parts of embedding vectors for first order facts <S> is alsopossible.

Returning again to FIG. 3, at this point structured semantic knowledge310 is obtained by the model training module 312 to solve the problem ofextracting a concept relevant to an image region. The modeling above isnow applied for “P(Concept <S,P,O>,Image I)” to extract all highprobability concepts about a portion of an image. This may be performedwithout choosing the most probable concept. For example, consider animage region that contains a smiling man who is wearing a blue shirt.Image pixel data “I” for this region will have high correlation withboth “<man, smiling>” and “<man, wearing, blue shirt>” and thus boththese concepts may be extracted for the same image region.

The knowledge extraction task may be solved by applying the above modelwith image pixel data from regions identified by an object proposalalgorithm or object regions identified by the R-CNN algorithm or even ina sliding window approach that more densely samples image regions. Tocapture object interactions, bounding boxes are generated from pairs ofobject proposals or pairs of R-CNN object regions. One approach is totry all pairs of potential object regions to test for possibleinteractions. Another approach is to apply some heuristics to be moreselective, such as to not examine pairs that are distant in image. Sincethe model may be applied to extract zero, one, or more high probabilityconcepts about an image region, the extracted <S,P,O> concepts may belocalized to image regions that provide the corresponding visual data.

Example Procedures

The following discussion describes knowledge extraction techniques thatmay be implemented utilizing the previously described systems anddevices. Aspects of each of the procedures may be implemented inhardware, firmware, or software, or a combination thereof. Theprocedures are shown as a set of blocks that specify operationsperformed by one or more devices and are not necessarily limited to theorders shown for performing the operations by the respective blocks. Inportions of the following discussion, reference will be made to FIGS.1-8.

FIG. 9 depicts a procedure 900 in an example implementation in which adigital medium environment is employed to extract knowledge from aninput image automatically and without user intervention. A digitalmedium environment is described to learn a model that is usable tocompute a descriptive summarization of an input image automatically andwithout user intervention. Training data is obtained that includesimages and associated text (block 902). The training data 320, forinstance, may include images 304 and unstructured text 306 that isassociated with the images 304, e.g., as captions, metadata, and soforth.

Structured semantic knowledge is extracted from the associated textusing natural language processing by the at least one computing device,the structured semantic knowledge describing text features (block 904).The structured semantic knowledge 310, for instance, may be extractedusing natural language processing to generate tuples, such as <subject,attribute>, <subject, predicate, object>, and so forth.

A model is trained using the structured semantic knowledge and theimages as part of machine learning (block 906). A model training module312, for instance, may train a neural network using the images 304 andstructured semantic knowledge 310. This knowledge may also be localizedas described in greater detail in relation to FIG. 10.

The model is used to form a structured image representation of the inputimage that explicitly correlates at least part of the text features withimage features of the input image as the descriptive summarization ofthe input image (block 908). The structured image representation, forinstance, may correlate concepts in the text with portions of the imagesalong with addressing a structure of the knowledge to describe “what isgoing on” in the images as a description summarization. This descriptionsummarization may be employed in a variety of ways, such as to locateimages as part of an image search, perform automated generation ofcaptions, and so on.

FIG. 10 depicts a procedure 1000 in an example implementation in which adigital medium environment is employed to extract knowledge and localizetext features to image features of an input image. A digital mediumenvironment is described to learn a model that is usable to compute adescriptive summarization of an object within an input imageautomatically and without user intervention. Structured semanticknowledge is extracted from text associated with images using naturallanguage processing by the at least one computing device (block 1002).Image features of objects within respective said images is localized ascorresponding to the text features of the structured semantic knowledge(block 1004). As before, structured semantic knowledge 310 is extracted.However, in this case this knowledge is localized to particular portionsof the image and thus may improve accuracy of subsequent modeling bypotentially differentiating between multiple concepts in an image, e.g.,the baby holding the toy and the man feeding the baby as shown in FIG.1.

A model is trained using the localized image and text features as partof machine learning (block 1006). A variety of different techniques maybe used, such as to perform probabilistic modeling. The model is used toform a structured image representation of the input image thatexplicitly correlates at least one of the textual features with at leastone image feature of the object included in the input image (block1008). For example, the structured logic determination module 318 maytake an input image 108 and form a structured image representation 106especially in instances in which the input image 108 does not includeassociated text. Further, the structured image representation 106 may belocalized to correlate concepts included in the text and image to eachother. As before, the structured image representation 106 may be used tosupport a variety of functionality, such as image searches, automatedcaption generation, and so forth.

Implementation Example

FIG. 11 depicts an example system 1100 usable to perform structured factimage embedding. This system 1100 support properties such as an abilityto (1) can be continuously fed with new facts without changing thearchitecture, (2) is able to learn with wild cards to support all typesof facts, (3) can generalize to unseen or otherwise not-directlyobservable facts, and (4) allows two way retrieval such as to retrieverelevant facts in a language view given an image and to retrieverelevant images give a fact in a language view. This system 1100 aims tomodel structured knowledge in images as a problem having views in thevisual domain V and the language domain L. Let “f” be a structured“fact” (i.e., concept) and “f_(l)ϵL” denotes the view of “f” in thelanguage domain. For instance, an annotated fact, with language view“f₁=<S:girl, P:riding, O:bike>” would have a corresponding visual view“f_(ν)” as an image where the fact occurs as shown in FIG. 11.

The system is configured to learn a representation that coversfirst-order facts <S> (objects), second-order facts <S, P> (actions andattributes), and third-order facts <S, P, O> (interaction and positionalfacts). These type of facts are represented as an embedding problem intoa “structured fact space.” The structured fact is configured as alearning representation of three hyper-dimensions that are denoted asfollows:φ_(S)ϵ

^(d) ^(s) ,φ_(P)ϵ

^(d) ^(P) , and φ_(O) ϵ

do

The embedding function from a visual view of a fact “f_(ν)” are denotedas the following, respectively:φ_(S),φ_(P), and φ_(O) as φ_(S) ^(ν) (f _(ν)),φ_(P) ^(ν)(f _(ν)), andφ_(O) ^(V)(f _(ν))

Similarly, the embedding function from a language view of a fact “f_(ν)”are denoted:φ_(S),φ_(P), and φ_(O)as respective ones of the following:

(f _(l)),

(f _(l)), and

(f _(l))

The concatenation of the visual view hyper-dimensions' is denoted as:φ^(ν)(f _(ν))

The concatenation of the language view hyper-dimensions' embedding isdenoted as:

(f _(l))where the above are the visual embedding and the language embedding of“f”, respectively, thereby forming:φ^(ν)(f _(ν))=[φ_(S) ^(ν)(f _(ν)),φ_(P) ^(ν)(f _(ν)),φ_(O) ^(ν)(f_(ν))],

(f _(l))=[

(f _(l)),

(f _(l)),

(f _(l))]

Thus, as is apparent from above the third-order facts <S, P, O> can bedirectly embedded to the structured face space by:φ^(ν)(f _(ν))for the image view and:

(f _(l))for the language view.

First-order facts are facts that indicate an object like <S: person>.Second-order facts are more specific about the subject, e.g., <S:person, P: playing>. Third-order facts are even more specific, e.g., <S:person, P: playing, O: piano). In the following, higher order facts aredefined as lower order facts with an additional modifier applied. Forexample, adding the modifier “P: eating” to the fact <S: kid> constructsthe fact <S: kid, P: eating>. Further, applying the modifier “O: icecream” to the fact <S: kid, P: eating> constructs the fact <S: kid, P:eating, O: ice cream>. Similarly, attributes may be addressed asmodifiers to a subject, e.g., applying “P: smiling” to the fact <5:baby> constructs the fact <5: baby, P: smiling>.

Based on the fact modifier observation above, both first and secondorder facts may be represented as wild cards, as illustrated in thefollowing equations for first-order and second-order facts,respectively.φ^(ν)(f _(ν))=[φ_(S) ^(ν)(f _(ν)),φ_(P) ^(ν)(f _(ν))=*,φ_(O) ^(ν)(f_(ν))=*],

(f _(l))=[

(f _(l)),

(f _(l))=*,

(f _(l))=*]φ^(ν)(f _(ν))=[φ_(S) ^(ν)(f _(ν)),φ_(P) ^(ν)(f _(ν)),φ_(O) ^(ν)(f_(ν))=*],φ

(f _(l))=[

(f _(l)),

(f _(l)),

(f _(l))=*]Setting “φ_(P)” and “φ_(O)” to “*” for first-order facts is interpretedto mean that the “P” and “O” modifiers are not of interest forfirst-order facts. Similarly, setting “φ_(O)” to “*” for second-orderfacts indicates that the “O” modifier is not of interest forsingle-frame actions and attributes.

Both first and second-order facts are named wild-card facts. Sincemodeling structured facts in visual data potentially allows logicalreasoning over facts from images, the described problem is alsoreferenced as a “Sherlock” problem in the following.

In order to train a machine learning model that connects the structuredfact language view in L with its visual view in V, data is collected inthe form of (f_(ν), f_(l)) pairs. Data collection for large scaleproblems has become increasingly challenging, especially in the belowexamples as the model relies on localized association of a structurelanguage fact “f_(l)” with an image “f_(ν),” when such facts occur. Inparticular, it is a complex task to collect annotations especially forsecond order facts <S, P> and third-order facts <S, P, O>. Also,multiple structured language facts may be assigned to the same image,e.g., <S: man, P; smiling> and <S: man, P: wearing, O: glass>. If thesefacts refer to the same man, the same image example could be used tolearn about both facts.

As previously described, techniques are discussed in which factannotations are automatically collected from datasets that come in theform of image/caption pairs. For example, a large quantity of highquality facts may be obtained from caption datasets using naturallanguage processing. Since caption writing is free-form, thesedescriptions are typically readily available, e.g., from socialnetworks, preconfigured databases, and so forth.

In the following example, a two-step automatic annotation process isdescribed (i) fact extraction from captions which includes any textassociated with an image that describes the image; and (ii) factlocalization in images. First, the captions associated with the givenimage are analyzed to extract sets of clauses that are consider ascandidate <S, P> and <S, P, O> facts in the image. Clauses form factsbut are not necessarily facts by themselves.

Captions can provide rich amounts of information to image understandingsystems. However, developing natural language processing systems toaccurately and completely extract structured knowledge from free-formtext is challenging due to (1) spelling and punctuation mistakes; (2)word sense ambiguity within clauses; and (3) spatial preposition lexiconthat may include hundreds of terms such as “next to,” “on top of,” aswell as collection phrase adjectives such as “group of,” “bunch of,” andso forth.

The process of localizing facts in an image is constrained byinformation in the dataset. For example, a database may contain objectannotations for different objects by training and validation sets. Thisallows first-order facts to be localized for objects using bounding boxinformation. In order to locate higher-order facts in images, visualentities are defined as any noun that is either a dataset object or anoun in a predefined ontology that is an immediate or indirect hypernymof one of the objects. It is expected that visual entitles appear eitherin the S or the O part, if it exists, for a candidate fact “f_(l)” whichallows for the localization of facts for images. Given a candidatethird-order fact, an attempt is first made to assign each “S” and “O” toone of the visual entities. If “S” and “O” are not visual entities, thenthe clause is ignored. Otherwise, the clauses are processed by severalheuristics. The heuristics, for instance, may take into account whetherthe subject or the object is singular or plural, or a scene. Forexample, in the fact <S: men, P: chasing, O: soccer ball> the techniquesdescribed herein may identify that “men” may involve a union of multiplecandidate bounding boxes, while for “soccer ball” it is expected thatthere is a single bounding box.

A straightforward way to model facts in images is to learn a classifierfor each separate fact. However, there is a clear scalability limitationin this technique as the number of facts is signification, e.g.,|S|×∥P|×|O| where |S|, |P| and |O| are the number of subjects,predicates, and objects, respectively. Thus, this number could reachmillions for possible facts in the real world. In addition toscalability problems, this technique discards semantic relationshipsbetween facts, which is a significant property that allowsgeneralization to unseen facts or facts with few examples. For instance,during training there might be a second-order fact like <S: boy, P:playing> and a first-order fact like <S: girl>, <S: boy>. At run time,the model trained using the techniques described herein understands animage with the fact <girl, playing> even if this fact is not seen duringtraining, which is clearly not captured by learning a model for eachfact in the training.

Accordingly, a two-view embedding problem is described in this examplethat is used to model structured facts. For example, a structured factembedding model may include (1) two-way retrieval (i.e., retrieverelevant facts in language view given and image, and retrieve relevantimages given a fact in a language view; and (2) wild-card facts aresupported, i.e., first and second order facts.

The first property is satisfied in this example by using a generativemodel p(f_(ν), f_(l)) that connects the visual and the language views of“f.” This technique first models the following:p(f _(ν) ,f _(l))∝s(φ^(ν)(f _(ν)),

(f _(l)))

where “s(•,•)” is a similarity function defined over the structured facespace denoted by “S”, which is a discriminative space of facts. This isperformed such that two views of the same fact are embedded close toeach other.

To model and train “φ^(V)(f_(ν))”, a CNN encoder is used and to train“φ^(L)(f_(l))” an RNN encoder is used. Two models are proposed forlearning facts, denoted by Model 1 and Model 2 in an exampleimplementation 1200 of FIG. 12. Models 1 and 2 share the same structuredfact language embedding and encoder but differ in the structured factimage encoder.

This process starts by defining an activation operator “ψ(θ, α)” where“α” is an input and “θ” is a series of one or more neural networklayers, which may include different layer types such as fourconvolution, one pooling, and another convolution and pooling. Theoperator “ψ(θ, α)” applies “θ” parameters layer by layer to compute theactive of “θ” subnetwork given ““α”. An operator “ψ(•,•)” is used todefine Model 1 and Model 2 structured fact image encoders.

In Model 1, a structured fact is visually encoded by sharingconvolutional layer parameters (denoted by θ_(c) ^(ν)) and fullyconnected layer parameters (denoted by θ_(c) ^(u)). Then, “W_(S) ^(ν),”“W_(P) ^(ν),” and “W_(O) ^(ν),” transformation matrices are applied toproduce “φ_(S) ^(ν)(f_(ν)), φ_(P) ^(ν)(f_(ν)), φ_(O) ^(ν)(f_(ν))” asfollows:φ_(S) ^(ν)(f _(ν))=W _(ν) ^(S)ψ(θ_(ν) ^(u),ψ(θ_(ν) ^(c) ,|f _(ν))),φ_(P)^(ν)(f _(ν))=W _(ν) ^(P)ψ(θ_(ν) ^(u),ψ(θ_(ν) ^(c) ,f _(ν))),φ_(O) ^(ν)(f _(ν))=W _(ν) ^(O)ψ(θ_(ν) ^(c) ,f _(ν)))

In contrast to Model 1, different convolutional layers are used in Model2 for “S” than for “P” and “O”, as consistent with the above discussionthat “P” and “O” are modifiers to “S” as previously described. Startingfrom “f_(ν),” there is a common set of convolutional layers, denoted by“θ_(ν) ^(c0)”, then the network splits into two branches, producing twoset of convolutional layers “θ_(ν) ^(cs)” and “θ_(ν) ^(cp0)” followed bytwo sets of fully connected layers “θ_(ν) ^(us)” and “θ_(ν) ^(uP0)”.Finally, “φ_(S) ^(ν)(f_(ν)), φ_(P) ^(ν)(f_(ν)), φ_(O) ^(ν)(f_(ν))” arecomputed by “W_(S) ^(ν),” “W_(P) ^(ν),” and “W_(O) ^(ν)),”transformation matrices as follows:φ_(S) ^(ν)(f _(ν))=W _(ν) ^(S)ψ(θ_(ν) ^(u) ^(S) ,ψ(θ_(ν) ^(c) ^(S),ψ(θ_(νg) ^(ν)),f _(ν)))),φ_(P) ^(ν)(f _(ν))=W _(ν) ^(P)ψ(θ_(ν)^(uPO),ψ(θ_(ν) ^(cp0),ψ(θ_(νg) ^(u) ,f _(ν)))),φ_(O) ^(ν)(f _(ν))=W _(ν)^(O)ψ(θ_(ν) ^(uPO),ψ(θ_(ν) ^(cPO)),ψ(θ_(νg) ^(ν) ,f ^(ν)))

In both models, a structured language fact is encoded using RNN wordembedding vectors for “S, P, and O.” Hence, in the case “φ_(S)^(L)(f_(L))=RNN_(θL)(f_(S) ^(L)), φ_(P) ^(L)(f_(L))=RNN_(θL)(f_(S)^(P)), φ_(OL)(f_(L) ^(O)=RNN_(θL)(f_(L) ^(O)),” where “f_(S) ^(L)”,“f_(S) ^(P)” and “f_(L) ^(O)” are the subject, predicate, and objectparts of “f_(L)ϵL”. For each of these, the literals are dropped and ifany of “f_(S) ^(L)”, “f_(S) ^(P)” and “f_(L) ^(O)” contain multiplewords, the average vector is computed as the representation of thatpart. The RNN language encoder parameters are denoted by “θ^(L)”. In oneor more implementations, “θ^(L)” is fixed to a pre-trained word vectorembedding model for “f_(S) ^(L)”, “f_(S) ^(P)” and “f_(L) ^(O)”.

One way to model “p(f_(ν), f_(l))” for Model 1 and Model 2 is to assumethat “p(f_(ν), f_(l))∞=exp(−loss_(W)(f_(ν), f_(l)))” and minimize“loss_(W)(f_(ν), f_(l))” distance loss which is defined as follows:loss_(w)(f _(ν) ,f _(l))=w _(S) ^(r)·∥φ_(S) ^(ν)(f _(ν))−

(f _(l))∥² +w _(P) ^(r)·∥φ_(P) ^(ν)(f _(ν))−

(f _(l))∥² +w _(O) ^(r)·∥φ_(O) ^(ν)(f _(ν))−

(f _(l))∥²which minimizes the distances between the embedding of the visual viewand the language view. A solution to penalize wild-card facts is toignore the wild-card modifiers in the loss through use of a weightedEuclidean distance, the weighting of which is based on whethercorresponding parts of the feature vectors are present, which is calleda “wild card” loss. Here “w_(X) ^(f)=1,” “w_(P) ^(f)=1,” and “w_(O)^(f)=1” for <S, P, O> facts, “w_(S) ^(f)=1,” “w_(P) ^(f)=1,” and “w_(O)^(f)=0” for <S, P> facts, and “w_(S) ^(f)=1,” “w_(P) ^(f)=0,” and “w^(f)_(O)=0” for <S> facts. Hence “loss_(w)” does not penalize the “O”modifier for the second order facts or the “P” and “O” modifiers forfirst-order facts, which follows the above definition of a wild-cardmodifier.

Accordingly, this example describes a problem of associating high-ordervisual and language facts. A neural network approach is described formapping visual facts and language facts into a common, continuousstructured fact space that allows natural language facts to beassociated with image and images to be associated with natural languagestructured descriptions.

Example System and Device

FIG. 13 illustrates an example system generally at 1300 that includes anexample computing device 1302 that is representative of one or morecomputing systems and/or devices that may implement the varioustechniques described herein. This is illustrated through inclusion ofthe knowledge extraction system 104. The computing device 1302 may be,for example, a server of a service provider, a device associated with aclient (e.g., a client device), an on-chip system, and/or any othersuitable computing device or computing system.

The example computing device 1302 as illustrated includes a processingsystem 1304, one or more computer-readable media 1306, and one or moreI/O interface 1308 that are communicatively coupled, one to another.Although not shown, the computing device 1302 may further include asystem bus or other data and command transfer system that couples thevarious components, one to another. A system bus can include any one orcombination of different bus structures, such as a memory bus or memorycontroller, a peripheral bus, a universal serial bus, and/or a processoror local bus that utilizes any of a variety of bus architectures. Avariety of other examples are also contemplated, such as control anddata lines.

The processing system 1304 is representative of functionality to performone or more operations using hardware. Accordingly, the processingsystem 1304 is illustrated as including hardware element 1310 that maybe configured as processors, functional blocks, and so forth. This mayinclude implementation in hardware as an application specific integratedcircuit or other logic device formed using one or more semiconductors.The hardware elements 1310 are not limited by the materials from whichthey are formed or the processing mechanisms employed therein. Forexample, processors may be comprised of semiconductor(s) and/ortransistors (e.g., electronic integrated circuits (ICs)). In such acontext, processor-executable instructions may beelectronically-executable instructions.

The computer-readable storage media 1306 is illustrated as includingmemory/storage 1312. The memory/storage 1312 represents memory/storagecapacity associated with one or more computer-readable media. Thememory/storage component 1312 may include volatile media (such as randomaccess memory (RAM)) and/or nonvolatile media (such as read only memory(ROM), Flash memory, optical disks, magnetic disks, and so forth). Thememory/storage component 1312 may include fixed media (e.g., RAM, ROM, afixed hard drive, and so on) as well as removable media (e.g., Flashmemory, a removable hard drive, an optical disc, and so forth). Thecomputer-readable media 1306 may be configured in a variety of otherways as further described below.

Input/output interface(s) 1308 are representative of functionality toallow a user to enter commands and information to computing device 1302,and also allow information to be presented to the user and/or othercomponents or devices using various input/output devices. Examples ofinput devices include a keyboard, a cursor control device (e.g., amouse), a microphone, a scanner, touch functionality (e.g., capacitiveor other sensors that are configured to detect physical touch), a camera(e.g., which may employ visible or non-visible wavelengths such asinfrared frequencies to recognize movement as gestures that do notinvolve touch), and so forth. Examples of output devices include adisplay device (e.g., a monitor or projector), speakers, a printer, anetwork card, tactile-response device, and so forth. Thus, the computingdevice 1302 may be configured in a variety of ways as further describedbelow to support user interaction.

Various techniques may be described herein in the general context ofsoftware, hardware elements, or program modules. Generally, such modulesinclude routines, programs, objects, elements, components, datastructures, and so forth that perform particular tasks or implementparticular abstract data types. The terms “module,” “functionality,” and“component” as used herein generally represent software, firmware,hardware, or a combination thereof. The features of the techniquesdescribed herein are platform-independent, meaning that the techniquesmay be implemented on a variety of commercial computing platforms havinga variety of processors.

An implementation of the described modules and techniques may be storedon or transmitted across some form of computer-readable media. Thecomputer-readable media may include a variety of media that may beaccessed by the computing device 1302. By way of example, and notlimitation, computer-readable media may include “computer-readablestorage media” and “computer-readable signal media.”

“Computer-readable storage media” may refer to media and/or devices thatenable persistent and/or non-transitory storage of information incontrast to mere signal transmission, carrier waves, or signals per se.Thus, computer-readable storage media refers to non-signal bearingmedia. The computer-readable storage media includes hardware such asvolatile and non-volatile, removable and non-removable media and/orstorage devices implemented in a method or technology suitable forstorage of information such as computer readable instructions, datastructures, program modules, logic elements/circuits, or other data.Examples of computer-readable storage media may include, but are notlimited to, RAM, ROM, EEPROM, flash memory or other memory technology,CD-ROM, digital versatile disks (DVD) or other optical storage, harddisks, magnetic cassettes, magnetic tape, magnetic disk storage or othermagnetic storage devices, or other storage device, tangible media, orarticle of manufacture suitable to store the desired information andwhich may be accessed by a computer.

“Computer-readable signal media” may refer to a signal-bearing mediumthat is configured to transmit instructions to the hardware of thecomputing device 1302, such as via a network. Signal media typically mayembody computer readable instructions, data structures, program modules,or other data in a modulated data signal, such as carrier waves, datasignals, or other transport mechanism. Signal media also include anyinformation delivery media. The term “modulated data signal” means asignal that has one or more of its characteristics set or changed insuch a manner as to encode information in the signal. By way of example,and not limitation, communication media include wired media such as awired network or direct-wired connection, and wireless media such asacoustic, RF, infrared, and other wireless media.

As previously described, hardware elements 1310 and computer-readablemedia 1306 are representative of modules, programmable device logicand/or fixed device logic implemented in a hardware form that may beemployed in some embodiments to implement at least some aspects of thetechniques described herein, such as to perform one or moreinstructions. Hardware may include components of an integrated circuitor on-chip system, an application-specific integrated circuit (ASIC), afield-programmable gate array (FPGA), a complex programmable logicdevice (CPLD), and other implementations in silicon or other hardware.In this context, hardware may operate as a processing device thatperforms program tasks defined by instructions and/or logic embodied bythe hardware as well as a hardware utilized to store instructions forexecution, e.g., the computer-readable storage media describedpreviously.

Combinations of the foregoing may also be employed to implement varioustechniques described herein. Accordingly, software, hardware, orexecutable modules may be implemented as one or more instructions and/orlogic embodied on some form of computer-readable storage media and/or byone or more hardware elements 1310. The computing device 1302 may beconfigured to implement particular instructions and/or functionscorresponding to the software and/or hardware modules. Accordingly,implementation of a module that is executable by the computing device1302 as software may be achieved at least partially in hardware, e.g.,through use of computer-readable storage media and/or hardware elements1310 of the processing system 1304. The instructions and/or functionsmay be executable/operable by one or more articles of manufacture (forexample, one or more computing devices 1302 and/or processing systems1304) to implement techniques, modules, and examples described herein.

The techniques described herein may be supported by variousconfigurations of the computing device 1302 and are not limited to thespecific examples of the techniques described herein. This functionalitymay also be implemented all or in part through use of a distributedsystem, such as over a “cloud” 1314 via a platform 1316 as describedbelow.

The cloud 1314 includes and/or is representative of a platform 1316 forresources 1318. The platform 1316 abstracts underlying functionality ofhardware (e.g., servers) and software resources of the cloud 1314. Theresources 1318 may include applications and/or data that can be utilizedwhile computer processing is executed on servers that are remote fromthe computing device 1302. Resources 1318 can also include servicesprovided over the Internet and/or through a subscriber network, such asa cellular or Wi-Fi network.

The platform 1316 may abstract resources and functions to connect thecomputing device 1302 with other computing devices. The platform 1316may also serve to abstract scaling of resources to provide acorresponding level of scale to encountered demand for the resources1318 that are implemented via the platform 1316. Accordingly, in aninterconnected device embodiment, implementation of functionalitydescribed herein may be distributed throughout the system 1300. Forexample, the functionality may be implemented in part on the computingdevice 1302 as well as via the platform 1316 that abstracts thefunctionality of the cloud 1314.

CONCLUSION

Although the invention has been described in language specific tostructural features and/or methodological acts, it is to be understoodthat the invention defined in the appended claims is not necessarilylimited to the specific features or acts described. Rather, the specificfeatures and acts are disclosed as example forms of implementing theclaimed invention.

What is claimed is:
 1. A method implemented by at least one computingdevice, the method comprising: obtaining, by the at least one computingdevice, training data including images and associated text; extracting,by the at least one computing device, a plurality of text featuresresulting from natural language processing of the associated text of thetraining data, the plurality of text features corresponding to a subjectand an object, respectively, within a respective said image of thetraining data; generating, by the at least one computing device, aplurality of bounding boxes in the respective said image for at leastone said text feature; localizing, by the at least one computing device,the at least one said text feature to a combination of a first saidbounding box for the subject and a second said bounding box for theobject; adding, by the at least one computing device, an additional areafrom the respective said image to the combination of the first saidbounding box and the second said bounding box; and training, by the atleast one computing device, a model using data that includes the atleast one said text feature as localized to the combination of the firstand second said bounding boxes having the additional area as part ofmachine learning.
 2. The method as described in claim 1, furthercomprising generating a descriptive summarization of the object of aninput image using the model.
 3. The method as described in claim 1,wherein the localizing is performed responsive to determining respectivedistance between the first and second said bounding boxes is within athreshold distance.
 4. The method as described in claim 1, wherein theassociated text is a caption or metadata of the respective said image.5. The method as described in claim 1, wherein the plurality of textfeatures are in a form of <subject, predicate, object>.
 6. The method asdescribed in claim 1, further comprising: removing at least one of theplurality of the text features from use as part of the training.
 7. Themethod as described in claim 1, further comprising: identifyingconfidence in the extracting.
 8. The method as described in claim 1,wherein the training includes adapting the plurality of text features orthe image features one to another, within a vector space.
 9. The methodas described in claim 1, wherein the model explicitly correlates theimage features of an input image with the plurality of text featuressuch that at least one of the image features is explicitly correlatedwith a first one of the plurality of text features but not a second oneof the plurality of text features.
 10. The method as described in claim1, wherein the plurality of text features are explicitly correlated tothe image features.
 11. A system implemented by at least one computingdevice comprising: an extractor module to extract a plurality of textfeatures from text associated with images in training data using naturallanguage processing; a grounding and localization module to: generatebounding boxes in a respective said image for at least one text featureof the plurality of text features; determine the bounding boxes includemultiple occurrences of a subject or an object of the at least one saidtext feature; identify relative positional information from the textassociated with the at least one said text feature; and ground the atleast one said text feature to a combination of a first said boundingbox for the subject and a second said bounding box for the object basedon the relative positional information, the at least one said textfeature is grounded to the combination of the first said bounding boxfor the subject and the second said bounding box for the object as asmallest rectangular area in the respective said image that includes thefirst said bounding box and the second said bounding box; and a modeltraining module to train a model using the training data having thegrounded at least one said text feature as part of machine learning. 12.The system as described in claim 11, wherein the associated text isunstructured.
 13. The system as described in claim 11, wherein theplurality of text features are in a form of a <subject, predicate,object> tuple.
 14. The system as described in claim 11, wherein theextractor module is configured to localize at least part of theplurality of text features as corresponding to respective portionswithin respective said images and as not corresponding to other portionswithin respective said images.
 15. The system as described in claim 11,further comprising a module configured to generate a caption for aninput image based on the plurality of text features.
 16. The system asdescribed in claim 11, further comprising a use module configured todeduce, based on the plurality of text features, scene properties of aninput image using the model.
 17. A method implemented by at least onecomputing device, the method comprising: obtaining, by the at least onecomputing device, training data including images and associated text;extracting, by the at least one computing device, a plurality of textfeatures using natural language processing from the associated text;generating, by the at least one computing device, bounding boxes in arespective said image for at least one text feature of the plurality oftext features; determining, by the at least one computing device, thebounding boxes include multiple occurrences of a subject or an object ofthe at least one said text feature; identifying, by the at least onecomputing device, relative positional information from the textassociated with the at least one said text feature; grounding, by the atleast one computing device, the at least one said text feature to acombination of a first said bounding box for the subject and a secondsaid bounding box for the object based on the relative positionalinformation; adding, by the at least one computing device, an additionalarea from the respective said image to the combination of the first saidbounding box and the second said bounding box; and training, by the atleast one computing device, a model using the combination of the firstand second said bounding boxes having the additional area for the atleast one said text feature.
 18. The method as described in claim 17,wherein the combination of the first and second said bounding boxes is asmallest rectangular area in the respective said image that includes thefirst and second said bounding boxes and the adding includes adding theadditional area from the respective said image that is not included inthe smallest rectangular area.
 19. The method as described in claim 17,wherein the plurality of text features are in a form of a <subject,predicate, object> tuple.